Automated feature generation for machine learning application

ABSTRACT

Various implementations include approaches for automating feature generation. The underlying intellectual paradigm to the approach is ensemble learning. That is, each generated feature is an element in an ensemble. Ensemble learning is a very successful paradigm in classical machine learning and dominates real-world predictive analytics projects through tools such as xgboost ( . . . ) or lightgbm ( . . . ). It is also appropriate, because of its ease-of-use compared to other successful paradigms such as deep learning. Moreover, it is possible to generate human-readable SQL code, which is very difficult with deep learning approaches. The various implementations described herein provide for increased scalability and efficiency as compared with conventional approaches.

BACKGROUND Technical Field

The present disclosure relates to machine learning. More specifically,the present disclosure relates to processes for generating features formachine learning applications in an efficient, effective manner

Related Art

Feature engineering takes up a major portion of the time in real-worlddata science projects. Moreover, expert time is by far the mostexpensive resource in such projects. In order to understand theimportance of the problem, consider the following example: A datascientist working for a bank wants to predict which bank customers willclose their account. The data scientist is given several tables: one ofthem contains a list of all accounts, when they were opened and whenthey were closed, if at all. Another table contains a list of alltransactions by a certain account.

The data scientist is likely to begin by developing hypotheses: forinstance, that an account that has conducted many transactions in thelast 90 days is less likely to be closed than those accounts with fewertransactions in that period. However, this is only one feature. If thedata scientist wants to develop a successful predictive model, he mustdevelop hundreds of features like this one. He will spend most of histime developing hypotheses and hand-crafting features. That leaveslittle time to test and develop the predictive model, making the processuntenable in many circumstances.

SUMMARY

All examples and features mentioned below can be combined in anytechnically possible way.

Various implementations include approaches for automating featuregeneration. The underlying intellectual paradigm to the approach isensemble learning. That is, each generated feature is an element in anensemble. Ensemble learning is a very successful paradigm in classicalmachine learning and dominates real-world predictive analytics projectsthrough tools such as xgboost ( . . . ) or lightgbm ( . . . ). It isalso appropriate, because of its ease-of-use compared to othersuccessful paradigms such as deep learning. Moreover, it is possible togenerate human-readable SQL code, which is very difficult with deeplearning approaches. The various implementations described hereinprovide for increased scalability and efficiency as compared withconventional approaches.

Some particular aspects include a computer-implemented method ofgenerating a feature for a machine-learning application, the methodincluding: a) identifying a preliminary feature comprising at least oneaggregation function that aggregates relational sample data or learnableweights applied to the sample data; b) calculating aggregation resultswith the aggregation function; c) determining, based upon anoptimization criterion formula, a quality with which the aggregationresults relate to target values; d) adjusting the preliminary feature byincrementally changing a condition applied to the aggregation function;e) calculating aggregation results of the adjusted preliminary featureby adjusting the aggregation results from a previous preliminaryfeature; f) determining a quality with which the aggregation results ofthe adjusted preliminary feature relate to the target values; and g)repeating processes (d) through (f) for a plurality of incrementalchanges, and selecting a feature that yields the best quality.

In additional particular aspects, a system includes: a computing devicehaving a processor and a memory, the computing device configured togenerate a feature for a machine-learning application by performingprocesses including: a) identifying a preliminary feature comprising atleast one aggregation function that aggregates relational sample data orlearnable weights applied to the sample data; b) calculating aggregationresults with the aggregation function; c) determining, based upon anoptimization criterion formula, a quality with which the aggregationresults relate to target values; d) adjusting the preliminary feature byincrementally changing a condition applied to the aggregation function;e) calculating aggregation results of the adjusted preliminary featureby adjusting the aggregation results from a previous preliminaryfeature; f) determining a quality with which the aggregation results ofthe adjusted preliminary feature relate to the target values; and g)repeating processes (d) through (f) for a plurality of incrementalchanges, and selecting a feature that yields the best quality.

Other particular aspects include a computer program product stored on anon-transitory computer readable medium, which when executed by acomputing device, causes the computing device to generate a feature fora machine-learning application by performing processes including: a)identifying a preliminary feature comprising at least one aggregationfunction that aggregates relational sample data or learnable weightsapplied to the sample data; b) calculating aggregation results with theaggregation function; c) determining, based upon an optimizationcriterion formula, a quality with which the aggregation results relateto target values; d) adjusting the preliminary feature by incrementallychanging a condition applied to the aggregation function; e) calculatingaggregation results of the adjusted preliminary feature by adjusting theaggregation results from a previous preliminary feature; f) determininga quality with which the aggregation results of the adjusted preliminaryfeature relate to the target values; and g) repeating processes (d)through (f) for a plurality of incremental changes, and selecting afeature that yields the best quality.

Still further aspects include a computer-implemented machine-learningmethod comprising: using one or more features determined and selectedwith the computer-implemented method of generating a feature for amachine-learning application, to calculate aggregation results fromsample data that is at least partially included in one or moreperipheral tables, joining the calculated aggregation results to apopulation table which includes target values; and training amachine-learning algorithm based on the aggregation results and thetarget values in the population table.

Implementations may include one of the following features, or anycombination thereof.

In certain cases, the condition affects which sample data is to beaggregated with the aggregation function.

In particular aspects, process (g) includes repeating processes (d)through (f) at least twenty times, and each time changing the appliedcondition such that the aggregation function aggregates a larger shareof the sample data or each time a smaller share of the sample data.

In some implementations, the condition splits the sample data intodifferent groups associated with different learnable weights.

In some cases, each group of sample data has one weight.

In certain aspects, instead of aggregating the values of sample datawithin one group, their respective weight is aggregated.

In particular cases, process (e) includes: incrementally adjusting theaggregation results from the previous preliminary feature.

In certain aspects, the method further includes: (h) outputting theadjusted feature that yields the best quality for use in themachine-learning application.

In some cases, the method further includes incrementally repeatingprocesses (a) through (g) for a plurality of additional conditions,wherein the adjusted feature becomes the preliminary feature during therepeating.

In particular aspects, calculating the result of the changed aggregationfunction includes using only a difference between the result from acurrent aggregation function and the result from a previous aggregationfunction.

In certain cases, outputting the adjusted feature is performed for onlyone of the adjusted features that has the best quality.

In some implementations, quality is defined by the optimizationcriterion formula.

In particular cases, the optimization criterion formula defines qualityby maximizing (e.g., R-squared) or minimizing (e.g., squared loss, logloss).

In particular aspects, each feature is attributed to a singleaggregation function.

In certain implementations, the method further includes: performingprocesses (a) through (g) using a distinct preliminary feature;calculating a pseudo-residual for the adjusted feature; and training thedistinct preliminary feature to predict an error from the adjustedfeature.

In some cases, the method further includes identifying which part of thesample data is affected by the presently applied incremental change ofthe condition, with the help of a match change identification algorithm,wherein process (e) of calculating the aggregation results of theadjusted preliminary feature comprises: adjusting the aggregation resultfrom the previous preliminary feature by calculating how the sample dataidentified by the match change identification algorithm changes theaggregation results from the previous preliminary feature.

In particular implementations, the condition defines a threshold forsample data, and only sample data on a specific side of the threshold isto be aggregated with the aggregation function, or the condition definesa threshold for additional data associated with the sample data, andonly sample data for which the associated additional data is on aspecific side of the threshold is to be aggregated with the aggregationfunction, or the condition defines a categorical value and only sampledata to which the categorical value of the condition is attributed is tobe aggregated with the aggregation function.

In some cases, the method further includes: after selecting the featurethat yields the best quality in process (g): (h) adding a furthercondition to the aggregation function and repeating processes (d)through (g) to determine the feature with the further condition thatyields the best quality; and repeating process (h) to add still furtherconditions until a stop algorithm determines that no more conditions areto be applied to the aggregation function.

In certain implementations, the method further includes: using anoptimization criterion update formula that describes how a change in theaggregation result of the aggregation function changes a previouslycalculated quality calculated with the optimization criterion formula,wherein process (f) of determining the quality for adjusted preliminaryfeatures comprises: determining with the optimization criterion updateformula how the change in the aggregation result affects the quality,without using the optimization criterion formula for calculating thequality for adjusted preliminary features.

In some aspects, the relational sample data comprises sample data setsin one or more peripheral tables, wherein the target values are includedin a population table, wherein calculated aggregation results areinserted into the population table, wherein the optimization criterionformula uses values from the population table but not from anyperipheral table.

In particular cases, the preliminary feature comprises at least a firstand second aggregation functions, the second aggregation functioncalculates an aggregation result from aggregation results of the firstaggregation function, and one or more conditions are applied to at leastone of the aggregation functions and processes (d) to (g) are carriedout with respect to the one or more conditions.

Some implementations use one or more features that have been determinedand selected to calculate aggregation results from sample data that isat least partially included in one or more peripheral tables. Thecalculated aggregation results are then joined to a population tablewhich includes target values. A machine-learning algorithm is nowtrained based on the aggregation results and the target values in thepopulation table.

In some aspects, the method further includes: calculating learnableweights to be aggregated by the aggregation function.

Two or more features described in this disclosure, including thosedescribed in this summary section, may be combined to formimplementations not specifically described herein.

The details of one or more implementations are set forth in theaccompanying drawings and the description below. Other features, objectsand benefits will be apparent from the description and drawings, andfrom the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an illustrative environment including a computer systemwith a feature identification engine according to embodiments of thedisclosure.

FIG. 2 is a flow diagram illustrating processes in an approach forgenerating features for machine learning applications according toembodiments of the disclosure.

FIG. 3 illustrates an example decision tree for performing functionsaccording to various embodiments of the disclosure.

FIGS. 4-21 illustrate example pseudocode for use in featureidentification according to various embodiments of the disclosure.

FIGS. 22-25 are flow diagrams illustrating processes in an approach forgenerating features for machine learning applications according toembodiments of the disclosure.

It is noted that the drawings of the disclosure are not necessarily toscale. The drawings are intended to depict only typical aspects of theinvention, and therefore should not be considered as limiting the scopeof the invention. In the drawings, like numbering represents likeelements between the drawings.

DETAILED DESCRIPTION

In the following description, reference is made to the accompanyingdrawings that form a part thereof, and in which is shown by way ofillustration specific illustrative embodiments in which the presentteachings may be practiced. These embodiments are described insufficient detail to enable those skilled in the art to practice thepresent teachings, and it is to be understood that other embodiments maybe used and that changes may be made without departing from the scope ofthe present teachings. The following description is, therefore, merelyillustrative.

Relational sample data sets usually require that an aggregationfunction, such as a SUM or AVERAGE function, calculates one featurevalue per sample data set. When sample data sets are used for training,a target value for each feature value may be given. The feature valuesare then used in machine learning applications to make predictions thatshould be close to the target values. The resulting quality is stronglyaffected by the choice of how to aggregate a sample data set. Conditionsmay be applied to change how sample data is aggregated. For eachcondition, a respective quality can be calculated. However, testing alarge number of different conditions, e.g., thousands of conditions,consumes much computation power. The present disclosure tackles thisproblem by using incremental changes in the conditions and thencalculate effects of the incremental changes instead of redoing allcalculations for each condition change independently. A smallincremental change of a condition typically affects only a very smallshare of the sample data. Therefore aggregation results and/or a qualityfor a changed condition can be calculated by using previous calculationresults (which were calculated for another condition) and just adjustingthese results for the impact of the small share of sample data affectedby the present incremental change of the condition. This allows to testa huge number of different conditions and find the condition the yieldsthe best quality for predicting target values. Conventional ways oftesting conditions put high calculation power demands that stronglyrestricted the number of conditions that could be analyzed. Hence, dataanalysts conventionally spent much time to select conditions whereas theapproach presented here allows to automatically find optimal conditions.To achieve this, it is a main concept of this disclosure to change acondition in small incremental steps and calculate which effects thisincremental change has.

As noted herein, current approaches of generating features for machinelearning applications are time consuming and ineffective. For example,conventional forecasting requires that a data analyst (or, “datascientist”) painstakingly scrutinize historical data to identifyfeatures that impact forecasting. These features are the result ofquerying database systems in which all interrelated data is pooled(e.g., so-called relational database systems). Then, the data scientistdevelops a forecasting system, using these features to train machinelearning algorithms to make the predictions. For example, if a bankwants to find out how many of its customers want to move to another bankin the coming quarter, the target size is the likelihood that a customerwill migrate. One possible feature would be the number of bankingtransactions a customer has made in the past X number of days (e.g., 90days). As a rule, hundreds of significantly more complex features areneeded to develop a meaningful and accurate forecasting system; they arethe basis on which the forecasting system can learn to make predictions.

The quality of the features and their selection thus determines thequality of the forecasting system; however, often the data scientistdoes not have time to identify all of the features that are important inpredicting the target size. The algorithm then has too few, orinsufficiently descriptive features available, resulting in poorpredictive quality. Conventionally, the only way to address this issueis for the data scientist, in close consultation with domain experts(e.g., a customer manager(s)), to track down vulnerabilities in theforecasting system over many hours, correct overvalued traits, identifypreviously overlooked traits, and judge their impact on target size.This process can be extremely time-consuming and nerve-wracking, andmisuse valuable intellectual resources.

In the case of non-relational data structures, algorithms that requirevery little problem-specific expertise on the part of the data scientistto develop forecasting systems currently exist. For example, neuralnetworks form their own class of algorithms, which are able toindependently find feature cascades in pictures, texts and time series,and derive forecasts from those inputs. Neural networks are sufficientat analyzing non-relational data structures if they are presented withmany different input examples, e.g., example images of a face to learnto recognize faces in yet unknown images. These non-relational scenariosdo not require that the data scientist explain to the algorithm what thefeatures of a face look like, because in this example, program queriescan check an image to determine whether the eyes, mouth, nose, etc. canbe identified in a specific image detail.

In contrast to conventional approaches, various aspects of thedisclosure include approaches for efficiently generating a feature forat least one condition in a machine-learning application, as well asaggregating learnable weights. These aspects of the disclosure include afeature identification engine including a statistical algorithm thatalmost completely automates the development of artificially intelligentsystems in terms of relational data structures. Using the featureidentification engine, the data scientist no longer has to extractfeatures in painstaking and error-prone work, but merely gives thesoftware an indication of which objective the algorithm shouldanticipate (e.g., customer migration or demand for goods or services),and where and in what form the raw data is stored. With this foundation,the feature identification engine automatically finds the most desirable(e.g., optimal) features. In the customer retention example describedabove, this feature identification system can be configured toindependently determine whether the number of transactions in the last90 days or, for example, in the last 95.5 days is the better feature fordetermining the probability that customers terminate their contracts.

Introduction and General Definitions

Embodiments of the disclosure can use techniques for generating afeature for one or more conditions in a machine-learning application.Embodiments of the disclosure include systems, computer program productsand methods employing a feature generation engine.

To better illustrate the various embodiments of the present disclosure,particular terminology which may be known or unknown to those ofordinary skill in the art is defined to further clarify the embodimentsset forth herein. The term “system” can refer to a computer system,server, etc., composed wholly or partially of hardware and/or softwarecomponents, one or more instances of a system embodied in software andaccessible to a local or remote user, all or part of one or more systemsin a cloud computing environment, one or more physical and/or virtualmachines accessed via the Internet, other types of physical or virtualcomputing devices, and/or components thereof.

The term population table is the main table of the problem to be solved.It defines the statistical population and contains the target values.For a given problem, there is only one population table. The problemalso contains at least one peripheral table. A peripheral table containsdata that needs to be aggregated before it can be joined onto thepopulation table using a join key. A peripheral table does not typicallycontain target values, which are values to be predicted (e.g., one ormore columns in the population table). However, in some cases, theperipheral table (or tables) is identical to the population table (alsoknown as a “self-join” scenario). This can occur in time seriesapplications, such as sales prediction (e.g., predicting tomorrow'ssales based upon average sales for that day of the week from a dataset). A match is a combination of two samples in two different tablesthat have the same join key. The value to be aggregated is determined bythe match and other circumstances that are defined in advance. Forinstance, the value to be aggregated can be a column in the peripheraltable, or several columns in the population table and the peripheraltable that are combined by some predefined mathematical equation, orlearnable weight. For example, a shop may have a “normal price” and an“actual price”, where the actual price is the amount that a customeractually paid for an item. The actual price may differ from the normalprice because the customer has a coupon or other discount voucher. Asnoted herein, approaches can account for differences between such values(e.g., normal v. actual price) as a value to be aggregated. An activatedmatch is included in the aggregation. By contrast, a deactivated matchis ignored by the aggregation. A condition is applied to theaggregation, e.g., after the WHERE keyword in SQL. Conditions impact theactivation status of a match.

An aggregation function may be understood as a function that calculatesone or more values (also referred to as aggregation results) from a dataset, also described as sample data or sample data set. Examples of anaggregation function are: SUM (which outputs the sum of sample data);AVG (which outputs an average or arithmetic average of sample data); MINand MAX (which output the minimal and maximal value of the sample data,respectively); MEDIAN (which outputs the median value of the sampledata); COUNT (which outputs the number of items included in the sampledata); VAR (which outputs the variance of the sample data); STDDEV(which outputs a standard deviation of the sample data). More generals,an aggregation function may be understood as a function that calculatesone or more values from a data set, wherein the number of calculatedvalues is smaller than the amount of values included in the sample dataset.

A sample data set on which the aggregation function is applied may beinitially sorted in any tables or it may be calculated from storedtables. For example, the differences between two columns may first becalculated and these differences may form one sample data set from whichthe aggregation function calculates a feature value.

A condition may be understood as an instruction how to select specificdata from the sample data. The condition may thus define which of thesample data is aggregated with the aggregation function and which sampledata is ignored. If there are several conditions for one aggregationfunction, a combination of TRUE and NOT TRUE results from the conditionsmay define whether or not sample data is aggregated by the aggregationfunction. Data that is selected through the conditions to be aggregatedwith the aggregation function is also referred to as an active match orbeing activated; whereas data that is judged by the conditions as not tobe aggregated is also referred to as an inactive match or as beingdeactivated.

In some embodiments, the conditions define groups and each sample datavalue within the same group is assigned the same weight whereas sampledata values in other groups are assigned other weights. Also whenweights are used, conditions can define that, if one or more specificconditions are met/not met, the corresponding sample data values aredeactivated and not included in the aggregation.

The optimization criterion formula may comprise one or more formulasthat describe how well the aggregation results of a feature relate tothe target values. For example, a prediction model may be used, such asa linear regression, with the aggregation results as values of the/anindependent variable to predict the target values, i.e. to determinecoefficient(s) of the prediction model that describe how aggregationresults relate to target values. Discrepancies between the aggregationresults and results from the prediction model may be used by theoptimization criterion formula to calculate the quality. In particular,the optimization criterion formula may depend on residuals orpseudo-residuals. Examples of the optimization criterion formula areformulas to calculate an R-squared value (a coefficient ofdetermination), e.g., a proportion of the variance in the target fromthe prediction. Another example is the squared loss function, whichcalculates the sum of squared difference between the prediction and thetarget. Similarly, other loss functions may be used.

A feature may be understood as a mathematical expression comprising anaggregation function to calculate an aggregation value from sample data(also referred to as sample i). The aggregation value is also referredto as the value of the feature for sample i and denoted as ∇f_(t,i).

In the present disclosure, indices may be written as subscript orregular characters, without any difference in the meaning beingintended. For example, w0 and w₀ may be used interchangeably. Theexpressions “optimization criterion formula”, “optimization criterionfunction” and “optimization criterion” may be understood as synonyms.

Computer System and Example Components

Turning now to FIG. 1, an illustrative environment 100 for implementingthe methods and/or systems described herein is shown. In particular, acomputer system 102 is shown as including a computing device 104.Computing device 104 can include, e.g., a feature identification engine106 which may include, e.g., one or more sub-systems (featureidentification algorithm 120 described herein) for performing any/all ofthe processes described herein and implementing any/all of theembodiments described herein.

Computer system 102 is shown including a processing unit (PU) 108 (e.g.,one or more processors), an I/O component 110, a memory 112 (e.g., astorage hierarchy), an external storage system 114, an input/output(I/O) device 116 (e.g., one or more I/O interfaces and/or devices), anda communications pathway 118. In general, processing unit 108 canexecute program code, such as feature identification engine 106, whichis at least partially fixed in memory 112. While executing program code,processing unit 108 can process data, which can result in reading and/orwriting data from/to memory 112 and/or storage system 114. Pathway 118provides a communications link between each of the components inenvironment 100. I/O component 110 can comprise one or more human I/Odevices, which enable a human user to interact with computer system 102and/or one or more communications devices to enable a system user tocommunicate with the computer system 102 using any type ofcommunications link. To this extent, feature identification engine 106can manage a set of interfaces (e.g., graphical user interface(s),application program interface(s), etc.) that enable system users tointeract with feature identification engine 106. Further, featureidentification engine 106 can manage (e.g., store, retrieve, create,manipulate, organize, present, etc.) data, through several modulescontained within or accessible via the feature identification algorithm120 (i.e., modules 122). Feature identification algorithm 120 is shownby example as being a sub-component of feature identification engine106. However, it is understood that feature identification algorithm 120may be a wholly independent system or set of instructions (e.g., logicstored in a separate location and/or executed on a distinct computingdevice).

The computer system 102 is programmed to perform a set of tasks used bythe feature identification engine 106. In some cases, memory 112 storingthe feature identification engine 106 and the feature ID algorithm 120can include various software modules 122, configured to performdifferent actions. Example modules can include, e.g., a comparator, acalculator, a determinator, etc. One or more modules can usealgorithm-based calculations, look up tables, software code, and/orsimilar tools stored in memory 112 for processing, analyzing, andoperating on data to perform their respective functions. Each modulediscussed herein can obtain and/or operate on data 130 from exteriorcomponents, units, systems, etc., or from memory 112 of computing device104.

In some cases, the modules 122 include one or more statisticalpredictive models, which can include multiple layers of models,calculations, etc., each including one or more adjustable calculations,logical determinations, etc., through any currently-known or laterdeveloped analytical technique for predicting an outcome based on rawdata. As described herein, the feature identification engine 106 can beconfigured to use data 130, including feature data, condition data,optimization criteria data, target value data, aggregation functiondata, etc. These data 130 can be used as inputs to further adjust logicin the feature ID algorithm 120 as discussed herein. Modules 122 canimplement one or more mathematical calculations and/or processes, e.g.,to execute the machining learning and/or analysis functions of thefeature identification engine 106.

Where computer system 102 comprises multiple computing devices, eachcomputing device may have only a portion of feature identificationengine 106 fixed thereon (e.g., one or more modules 122). However, it isunderstood that computer system 102 is only representative of variouspossible equivalent computer systems that may perform a processdescribed herein. Computer system 102 can obtain or provide data, suchas data stored in memory 112 or storage system 114, using any solution.For example, computer system 102 can generate and/or be used to generatedata from one or more data stores, receive data from another system,send data to another system, etc.

Using Feature Identification Engine to Generate a Feature for a MachineLearning Application

FIG. 2 is a flow diagram illustrating processes in generating a featurefor a machine learning application. As shown, approaches can include:

Process 201: identifying a preliminary feature comprising an aggregationfunction. In particular implementations, each feature is attributed to asingle aggregation function. The aggregation function may be used toaggregate relational sample data or learnable weights applied to thesample data.

Process 202: calculating aggregation results with the aggregationfunction. In particular implementations, the sample data comprisesseveral sample data sets and the aggregation function is applied on eachsample data set. For example, the aggregation function may be a SUMfunction and outputs for each sample data set a respective sum, referredto as an aggregation result.

Process 203: determining, based upon an optimization criterion formula,a quality with which the aggregation results relate to target values.The optimization criterion formula may, for example, comprise a lossfunction such as a squared loss function that uses quadratic differencesto target values, or a log loss function that uses the logarithm of arespective difference to a target value. A difference may be calculatedbetween an aggregation result and a target value. The quality may be thevalue output by the optimization criterion formula; in the case of aloss function a smaller value indicates a better quality. Theoptimization criterion formula may also comprise a reward function orR-squared function which again calculates a value (quality) from theaggregation results and the target values; in this case a higher valueindicates a higher quality.

Process 204: adjusting the preliminary feature by incrementally changinga condition applied to the aggregation function. The condition maydetermine which sample data or how sample data is used for calculatingthe aggregation results. For example, only if sample data meets thecondition, it is aggregated with the aggregation function. As anexample, the condition may define a threshold and only sample data belowor above the threshold will be aggregated by the aggregation function.As described elsewhere in this disclosure in greater detail, a conditionis incrementally changed; in the case of a threshold, the threshold ischanged by a (small) step such that the change in the condition affectsonly a small fraction of the sample data, in particular at most 1% or 5%of the sample data. Efficiency gains possible through incrementalchanges will be easily understood in the following description.

Process 205: calculating aggregation results of the adjusted preliminaryfeature by adjusting the aggregation results from a previous preliminaryfeature. In particular implementations, an adjusted preliminary featurediffers from a previous preliminary feature in that a condition isincrementally varied and affects only a small portion of the sampledata. Hence it may accelerate calculations if it is determined how theaggregation results from a previous preliminary feature are to beadjusted instead of calculating all aggregation results anew. Forexample, the changed condition may result in that some sample data setsare not affected at all. An aggregation result for such a sample dataset does not need to be calculated anew but rather the aggregationresult from a previous preliminary feature may be used. Furthermore, ifonly a fraction of the sample entries within one sample data set areaffected by the change in the condition, the previous aggregation resultmay be adjusted by adding an aggregation result of the fraction ofsample entries: For example, the condition may be a threshold that ischanged from a value x to x+1, resulting in that just three more sampleentries within a sample data set of, e.g., 10,000 values, meet thecondition and shall be included in the aggregation; the aggregationfunction may be a SUM function and now the sum of the three identifiedsample entries is added to the previous aggregation result, i.e., thesum of the sample entries covered by the previous threshold of x. Usingsuch previous aggregation results is particularly useful in combinationwith incremental changes which typically require merely smalladjustments to previous aggregation results. Hence, a previousaggregation result may be merely changed by the effect that the sampledata entries affected by the change of the condition in process 204 hason the previous aggregation result; this procedure may also be describedas incrementally adjusting or updating the previous aggregation result.In particular cases, calculating the result of the changed aggregationfunction includes using only a difference between the result from acurrent aggregation function and the result from a previous aggregationfunction.

Process 206: determining a quality with which the aggregation results ofthe adjusted preliminary feature relate to the target values. Thisdetermination may be carried out as described above, using theoptimization criterion function. However, it may be computationally moreefficient not to perform the whole calculations of the optimizationcriterion function for each incremental change but rather to use andadjust previous results. In some implementations, an optimizationcriterion update formula may be used to calculate a quality for thepresent preliminary feature with the changed condition. The optimizationcriterion update formula may describe how the quality previouslycalculated for previous aggregation results is to be adjusted to accountfor the incremental change. In some cases, the incremental change of acondition may result in that many of the aggregation results remain thesame while just a few of the aggregation results change. Theoptimization criterion update formula may now use aggregation resultsfrom a previous preliminary feature. The optimization criterion updateformula may also include mathematical expressions that describe how achanged aggregation result affects the quality. In particular, if anaggregation result changes, the impact of the previous aggregationresult may be subtracted and the impact of the changed aggregationresult may be added.

Process 207 (optional, in some implementations): output the adjustedfeature for use in the machine-learning application. In certainimplementations, only the adjusted feature having the best quality isoutput. In particular implementations, as noted herein, quality isdefined by the optimization criterion, for example, by maximizingR-squared or minimizing the squared loss.

In various implementations, in loop 208, processes 204 through 206 arerepeated for a plurality of incremental changes, and a feature isselected that yields the best quality. In some implementations, thequality corresponds to the result of a loss function and hence the bestquality is the smallest value calculated for any of the preliminaryfeatures. In other implementations, the quality corresponds to a resultof a reward function or R-squared function and hence the best quality isthe largest value calculated for any of the preliminary features.

In processes 201 and 202, identifying a feature comprising anaggregation function and calculating aggregation results with theaggregation function may be performed with or without a condition beingapplied. If a condition is applied, the condition in process 204 may beunderstood as a different or varied condition. Process 202 may also usea condition such that all data is aggregated. For example, the conditionmay define a threshold such that all sample data is on the same side ofthe threshold. Process 204 then changes this threshold such that thecondition applies only to a part of the sample data. Through loop 208the threshold is consecutively further varied. In this way, thecondition is incrementally changed, and each time a quality associatedwith this condition is calculated in process 206. For example, the loop208 may be repeated twenty or more times and each time the condition isvaried such that a smaller share of the sample data is aggregated. Inother implementations, process 202 may use a condition which is set suchthat only a small fraction of the sample data, e.g., at most 5% or 10%,is aggregated, and then process 204 changes the condition such that alarger fraction of the sample data is aggregated. Each time loop 208repeats, the condition is changed to increase this fraction. Forexample, the loop 208 may be repeated twenty or more times and each timethe condition is varied such that a larger share of the sample data isaggregated.

A condition may refer to the sample data to be aggregated or to otherdata associated with the sample data. For example, the sample data maybe any numerical values whereas the other data may be calendar data.Each calendar date is associated with zero, one or more entries of thesample data. A condition may now select specific calendar dates and theaggregation function only aggregates the sample data associated with theselected calendar dates. Calendar dates are just one example, andinstead a condition may refer to any other data (numerical values orcategorical indications) associated with the sample data.

In additional implementations, in loop 209, processes 201-206 arerepeated incrementally for a plurality of additional conditions. Inthese cases, the adjusted feature becomes the preliminary feature duringthe repeating of process 201. As an example, the first condition maydefine an upper threshold (the value of the upper threshold beingdetermined in process 206 as the upper threshold with which the bestquality can be achieved) and an additional condition may define a lowerthreshold for which again the value is determined in process 206. Stillfurther additional conditions may then define further upper and lowerthresholds that form several spaced intervals of data to be aggregated.

In additional cases, the processes shown in loop 209 (repeatingprocesses 201-206) can be performed using a distinct preliminaryfeature. That is, processes 201-206 can be repeated for a distinctpreliminary feature on the same condition. In these cases, followingrepeated process 206, performed using a distinct preliminary feature,the method can further include additional processes:

(I) calculating a pseudo-residual for the adjusted feature; and

(II) training the distinct preliminary feature to predict an error fromthe adjusted feature.

In further implementations, process 202 does not aggregate the sampledata itself, but instead learnable weights are assigned to the sampledata and used as values to be aggregated. For example, initially thesame weight may be assigned to all sample data, or the sample data maybe split into a first group and a second group, based on an initialcondition, and a first weight is assigned to the sample data in thefirst group and a different second weight is assigned to the sample datain the second group. The aggregation function then does not aggregatethe sample data itself but the weights assigned to the sample data. Thecondition defines how the sample data is split into groups and thusdefines the number of sample data entries per group. With the number ofsample data entries per group, an algorithm may be employed to calculatethe best weights (for these numbers of entries per group) to describegiven target values. Each time the condition is varied, the weights arecalculated anew and again a quality is determined with which the targetvalues can be predicted, using this condition.

An implementation of the method which uses learnable weights may also bedescribed with the following processes:

Process A) set a (preliminary) condition that splits sample data intotwo groups; determining the number of sample data entries per group; forassigning a respective weight to each group of sample data: use afunction for calculating the weights of the groups based on a given lossfunction, given target values, and the respective number of sample dataentries in each group; use an aggregation function to calculate anaggregation result by aggregating all weights assigned to the sampledata; and use an optimization criterion function to determine a qualitywith which the aggregation results relate to the target values;

Process B) (incrementally) change the preliminary condition to split thesample data differently into two groups and repeating process A) foreach changed preliminary condition; and set the preliminary conditionwith the highest quality as a fixed condition;

Process C) test different preliminary subconditions that each split oneof the groups set with the fixed condition into two new groups andperform process A) for each of the preliminary subconditions; set thepreliminary condition with the highest quality as a fixed subcondition;

repeating process C) to determine and set further subconditions;

output the set conditions and subconditions, optionally together withthe corresponding determined weights.

The above-noted processes are described in greater detail by way of thefollowing examples.

Example Calculations: Pseudo-Residuals

In some cases, pseudo-residuals are calculated according to one or moreequations. For example, Let L( . . . ) be a loss function, yi be thetarget for sample i and y{circumflex over ( )}t−1,i be the predictiongenerated from all previous predictors in the ensemble. Given thecurrent prediction, begin by calculating the pseudo-residuals. There area number of ways this can be performed, usually based on the first andsecond derivatives of the loss function L( . . . ):

$\begin{matrix}{g_{i}:={\frac{\partial{L\left( {{\hat{y}}_{{t - 1},i};y_{i}} \right)}}{\partial{\hat{y}}_{{t - 1},i}}.}} & (1) \\{h_{i}:={\frac{\partial{\partial{L\left( {{\hat{y}}_{{t - 1},i};y_{i}} \right)}}}{{\partial{\hat{y}}_{{t - 1},i}}{\partial{\hat{y}}_{{t - 1},i}}}.}} & (2)\end{matrix}$

Negative gradients can be used as pseudo-residuals, which are denoted asε_(i):

∈_(i):=−g_(i),   (3)

An alternative way to calculate the pseudo-residuals is to use a Taylorapproximation:

$\begin{matrix}{\epsilon_{i}:={- {\frac{g_{i}}{h_{i}}.}}} & (4)\end{matrix}$

Yet another approach is to use target variables (e.g., in this case,calculating an update rate is unnecessary, and a difference between thefeatures will be based on randomness generated through bootstrapping orother means):

∈_(i):=y_(i),   (5)

Example Calculations: Optimization Criteria

Various optimization criteria that can be incrementally updated, can beused in conjunction with the disclosed implementations. One optimizationcriterion that can be used for the purposes of this study is R-squared.For this approach to work, one must express R-squared in a way that canbe incrementally updated. This is achieved as follows: Let ∇ft,i be thevalue of a generated feature for sample i. Let I be the number ofsamples in the population table. As noted herein, Ei denotes thepseudo-residual. The goal is then to manipulate the ft,i in order tomaximize the following, referred to as R-squared:

$\begin{matrix}{\frac{\left( {{I*{\sum_{i}{{\nabla f_{t,i}}\epsilon_{i}}}} - {\left( {\sum_{i}{\nabla f_{t,i}}} \right)*\left( {\sum_{i}\epsilon_{i}} \right)}} \right)^{2}}{\left( {{\sum_{i}{\nabla f_{t,i}^{2}}} - \left( {\sum_{i}{\nabla f_{t,i}}} \right)^{2}} \right)*\left( {{\sum_{i}\epsilon_{i}^{2}} - \left( {\sum_{i}\epsilon_{i}} \right)^{2}} \right)}.} & (6)\end{matrix}$

The R-squared as expressed in equation (6) can be incrementally updated,as reflected in equation (6a). Let Sxy:=

Σ_(i) ∇f_(t,i)∈_(i) , S _(x):=Σ_(i) ∇f _(t,i) , S _(xx):=Σ_(i) ∇f _(t,i)²,   (6a)

When sample i has changed, equations (6) and (6a) can be incrementallyupdated as follows:

S _(xy) ^(new) =S _(xy) ^(old)+(∇f _(t,i) ^(new) −∇f _(t,i) ^(old))∈_(i)

S _(x) ^(new) =S _(x) ^(old) +∇f _(t,i) ^(new) −∇f _(t,i) ^(old)

S _(xx) ^(new) =S _(xx) ^(old) +∇f _(t,i) ^(new) ∇f _(t,i) ^(new) −∇f_(t,i) ^(old) ∇f _(t,i) ^(old).   (7)

The new R-squared can then be calculated as follows:

$\begin{matrix}{\frac{\left( {{I*S_{xy}^{new}} - {S_{x}^{new}*\left( {\sum_{i}\epsilon_{i}} \right)}} \right)^{2}}{\left( {S_{xx}^{new} - S_{x}^{{new}^{2}}} \right)*\left( {{\sum_{i}\epsilon_{i}^{2}} - \left( {\sum_{i}\epsilon_{i}} \right)^{2}} \right)}.} & (8)\end{matrix}$

Example Calculations: Condition Trees

It is possible to consider the case of the conditions, “WHERE somecolumns>0.5” and “WHERE some column>0.6”. In all likelihood, only asmall share of samples will be affected by this incremental change. Itwould therefore be inefficient to recalculate all of the samples. Inthis section, an efficient, greedy and decision-tree-like approach tosolve this problem is presented. For this approach to be effective, anyoptimization criterion that can be incrementally updated is required.Depending on the conditions, every match can either be activated ordeactivated. Aggregation should also be expressed in a form that allowsfor incremental updates. These incremental updates are discussed furtherherein. It is also possible that the path from the peripheral table tothe population table contains more than just one aggregation. There canbe a set of nested aggregation such as the SUM of a COUNT or the AVG ofa SUM. Therefore, it must be possible to update aggregations based notjust on the changing of the activation status of a match, but also basedon the changed values of another aggregation. This is discussed furtherherein. The remaining question is how to find an optimal set ofconditions that will help maximize the optimization criterion. As shownherein, a greedy, tree-like model was used in this example, which isreferred to as a condition tree herein. An example of a condition treeis provided in FIG. 3.

In this example, all matches are activated by default. A condition isthen applied that can either be true or false. If the condition is truefor the match, its activation status is changed. If the condition isfalse, the activation status is left unchanged. The activation statusalso determines which condition is applied next. In embodiments, wherethe values to be aggregated are learnable weights, the conditionsadditionally determine which weight to use. Once the end of thecondition tree is reached, the status of the match is either activatedor deactivated. All activated matches will be included in theaggregation. All deactivated matches will be ignored by the aggregation.

In some cases, condition trees can be expressed in SQL code. Forexample, it is possible to express all paths through the condition treethat lead to the match being activated and express them as conditions.For instance, the condition tree expressed in FIG. 3 is expressed inpseudo-SQL code in FIG. 4.

Example Calculations: Training Condition Trees

Condition trees are trained using a greedy algorithm. In one approach,at every node, iterate through a set of possible conditions, changingthe activation status of all matches that are affected by the condition.Then choose the condition that constitutes maximum improvement of theoptimization criterion. For the training approach to work, the followingfunctions are useful:

1. A binning function, denoted as bin Junction. This function determinespossible ways to split the matches, based on how this is doneconventional approaches for decision trees (e.g., described in“Approximate splitting for ensembles of trees using histograms,”Chandrika Kamath et al., incorporated by reference herein). One way isto define a set of threshold values and then separate the matchesdepending on whether the corresponding entries in the matches aresmaller of greater than the threshold. However, other methods arepossible, such as splitting along categorical variables.

2. A function for calculating incremental updates, denotedupdate_incrementally. This function first identifies all matches forwhich the activation status needs to be changed since the last timeupdate_incrementally or activate all was called. After changing theiractivation status, it incrementally updates the aggregations. Examplesfor incrementally updating the aggregations are described herein. Forall samples in the population table that have been affected by thechanging matches, it then incrementally updates the optimizationcriterion. Examples for this process are provided in equation (7) andequation (8). Pseudocode for this function is provided in FIG. 5. Inembodiments, where the values to be aggregated are learnable weights,update_incrementally will also calculate the learnable weights.

3. A function for finding the best split, denoted find_best_split. Thebest split is the one that leads to the maximum improvement of theoptimization criterion. This best split may be used as one condition.

4. A stopping criterion, denoted as the function satisfies_stopping_criterion. The conventional usage of decision trees containsmany options for stopping criteria, such as a maximum depth, a minimumrequirement on the improvement of the optimization criterion, a minimumnumber of samples on each new leaf or a combination of such factors.

5. A function for updating the condition tree, denotedupdate_condition_tree. The function takes the best split and updates thecurrent node of the condition tree according to the split, thusseparating the matches into a set for which the activation status ischanged and a set for which the activation status is not changed.

FIG. 6 shows a training algorithm for a condition tree that is used insome implementations of the invention.

Example Calculations: Train Simple Regression Model

The next step is to train a simple prediction model with thepseudo-residual Ei as the dependent and the feature values ∇ft,i as theindependent variable. This step can be skipped, if the pseudo-residualsare simply the targets or when the values to be aggregated are learnableweights. Predictions generated by the regression model can be denoted as∇y{circumflex over ( )}t,i. Examples for such a model include a linearregression or a regression tree.

Example Calculations: Calculate the Update Rate

The final step is to find the update rate α such that it minimizes L(yi,y{circumflex over ( )}t−1, i+a*∇y{circumflex over ( )}t,i). This stepcan be skipped, if the pseudo-residuals are simply the targets. This canbe achieved using a line search or a second-order Taylor approximation.

It still remains true that:

gi:∂=L(y{circumflex over ( )}t−1,i:yi) and hi:=∂∂L(y{circumflex over( )}t−1, i:yi)

The update rate can then be calculated as follows:

$\begin{matrix}{\alpha:={\frac{\sum_{i}{g_{i}{\nabla{\hat{y}}_{t,i}}}}{\sum_{i}{h_{i}{\nabla{\hat{y}}_{t,i}}{\nabla{\hat{y}}_{t,i}}}}.}} & (9)\end{matrix}$

Example Calculations: Incrementally Update Aggregation Function(s)

Incremental updating is highly beneficial to efficient processingaccording to various implementations, both in terms of time and memory.Examples are provided above for aggregations where the incrementalupdate is comparatively trivial, such as COUNT, AVG or SUM.

By contrast, this section is dedicated to introducing an approach forincrementally updating aggregations for which incremental updates arenon-trivial. In order to keep the approach scalable, an algorithm wasdeveloped according to various implementations that requires little timeand memory to achieve the above-noted goal.

In order to accomplish this, two arrays of matches are used, denotedmatches and ptr. As noted herein, a match connects samples in twodifferent tables, one of which is aggregated and then joined to theother. For simplicity, the former table is referred to as the inputtable and the latter as the output table.

The index of sample in the output table associated with a match isreferred to as ix_output and it is assumed that there is a functiondenoted get_ix_output( . . . ) which returns the ix_output of a match.Moreover the function get_value( . . . ) returns the underlying value tobe aggregated (which can be a value in the input table or a combinationof several values in the input and output table).

The array matches is sorted by the ix_output and the value to beaggregated, in that order. A single element in ptr is denoted p.

The function that returns the element in matches from p is denotedget_match( . . . ). Moreover, the elements ptr know the locations oftheir corresponding entries in matches. S solution that fulfills thelatter requirement is to have ptr be a set of pointers to matches, butother approaches are possible, such as having the elements in ptrcontain pointers or indices that signify the corresponding element inptr.

It can also be assumed that the operators ==, >, <, <= and >= aredefined on the matches. These operators compare the position in thearray matches rather than the underlying values. Moreover, the operators++ and −− move the match to the next/previous position in matches. Oneway to implement this is to use iterators, but other ways, such as usingoperator overloading or custom methods, are possible.

In addition to matches and ptr, the following data structures arerequired for at least some aggregations: An array denoted counts, whichcounts the active matches for each ix_output, an array denotes sums,which stores the sums of the currently aggregated values, an arraydenoted values, which stores the current values aggregation for eachix_output and an array denoted current_matches, which dynamically storesmatches that are of significance for incrementally updating certainaggregations.

Also, it is assumed that there is a NULL value. This NULL can either benan or some imputation value, such as 0.0, chosen by the user.Additionally, this approach also requires a data structure denotedunique_indices, which can store all indices that have been altered by anaggregation without containing duplicate entries. It is assumed that thedata structure (e.g., array, hash map, etc.) has a method denoted.insert( . . . ), which adds an index if it is not already contained inthe data structure. Options for applying this approach are described inthe subsequent sections.

Moreover, the matches need to contain a Boolean variable activated,which signifies whether the match is currently activated. Finally, someaggregations require the functions find_next_smaller andfind_next_greater, which return the next smaller/greater activated matchand can be efficiently implemented as shown in FIG. 7.

It is possible to begin by describing the algorithm for incrementallyupdating the COUNT aggregation in pseudocode, which is one usefulaggregation to update incrementally, as illustrated in FIG. 8.

The SUM aggregation function is shown in FIG. 9.

The AVG aggregation is more complicated than the SUM or COUNTaggregations, because the AVG aggregation must maintain the sums andcounts, as shown in FIG. 10.

The VAR or VARIANCE keeps track of more sufficient statistics, as shownin FIG. 11.

The STDDEV aggregation builds on VAR, as shown in FIG. 12.

The third moment, or SKEWNESS, can be updated incrementally, as shown inthe algorithm depicted in FIG. 13.

FIG. 14 shows the MAX aggregation, which is the first aggregation thatrequires the matches to be sorted. This algorithm can be used forincrementally updating the MAX aggregation.

After developing the algorithm for incrementally updating the MAXaggregation, the MIN aggregation can be incrementally updated, using aslightly modified version, as depicted in FIG. 15.

The median aggregation follows a similar approach to the MIN/MAXaggregation, but with some additional challenges. FIGS. 16 and 17illustrate algorithms for calculating the median aggregation. FIG. 16shows the algorithm for activating the MEDIAN aggregation, and FIG. 17shows the algorithm for deactivating the MEDIAN aggregation. In thisexample, the greater value is stored in current_matches, which is anarbitrary convention. In this case, the smaller value could be stored aswell.

The COUNT DISTINCT aggregation can be updated as shown in FIGS. 18 and19. FIG. 18 shows the algorithm for activating the COUNT DISTINCTaggregation. FIG. 19 shows the algorithm for deactivating the COUNTDISTINCT aggregation.

The above-noted algorithms can have various performance implications.For example, let l be the number of matches (thus the length of thearray matches), let n be the number of matches per ix_output, let t bethe number of bins returned by bin_function, and let p be the number ofix_output that is needed to update within one bin returned bybin_function.

For comparison, it is beneficial to first discuss the time complexity ofupdating aggregations in a non-incremental way. Non-incremental updatesmean that every time there is a change to one value going into theaggregation, the aggregation must be recalculated over all values thatgo into that aggregation. If the time complexity of an aggregation isdenoted in n as Oa(n), then the cost of such an operation will beO(Oa(n)tp). Oa(n) is likely to be linear for most aggregations, but notfor COUNT DISTINCT and MEDIAN.

Note that t and p are inversely related, but this relationship isunlikely to be linear, meaning that there is a performance penalty tohaving many bins. Also note that this operation is particularlyexpensive when the number of matches per ix_output is very high. Thiscan be a considerable scalability problem, because sampling from thepopulation table implies that the number of ix_outputs is going to beconstant, but having many of matches per ix_output (high n) also impliesa higher p. This is the case, because when keeping the number ofix_outputs constant, n and l are linearly related to each other, whichmeans that there must be more matches within one bin, and most likelyleads to a higher p. The only scenario where this would not be the caseis if all additional matches fall into the same bin, which is unlikely.

Coupling t with l, as most conventional implementations do, will reducep, but not linearly to the increase in t. In other words, given aconstant number of ix_outputs, the time complexity of this approach in lis O(l) in the best case, but worse than O(l) under a more probablescenario.

Contrast this with the time complexity of incremental updates: it ispossible to see that the time complexity of incrementally updating theCOUNT, SUM, AVG, VAR, STDDEV and SKEWNESS aggregations as describedabove is exactly O(l). It is also possible to see that the timecomplexity of activating a sample in the MIN and MAX aggregation isO(l). Closer analysis reveals that the time complexity of deactivating asample is also O(l). This is the case, because the functions find nextsmaller and find next greater always move in one direction. Therefore,when successively deactivating an array of matches, the maximum numberof steps by these functions is l, regardless of n. However, this doesrequire sorting the array matches once at the very beginning of trainingthe multi-relational decision tree. The time complexity of thatoperation depends on the sorting algorithm, but in many cases is O(l logl).

The time complexity of incrementally updating the MEDIAN aggregation isO(ln) in the worst case and O(l) in the best case. It is difficult toestimate the cost of calculating the MEDIAN globally every time at leastone input has changed, because it would require us to make assumptionson the distribution of the matches and their corresponding criticalvalues. What is more, the incremental update will become less expensivewhen many matches are already aggregated (because it will be easier tofind the next split), whereas the global approach will become moreexpensive. That means as the tree grows deeper, it is more likely thatmany matches are activated and will be unaffected by the most recentsplit. In such a scenario, incremental updates will be more efficient.

The time complexity of incrementally updating the COUNT DISTINCTaggregation will depend on the number of unique values per ix output:the more values, the cheaper it gets to update COUNT DISTINCT.Therefore, the worst case is that there is only one unique value per ixoutput, in which case the time complexity is O(ln) in addition to thecost for sorting once in the very beginning. However, this worst case isunrealistic in practice. What is more, like the MEDIAN aggregation, theCOUNT DISTINCT aggregation will become cheaper as more values areaggregated.

As shown in the examples herein, the incremental updating algorithmsdiscussed in this section can yield considerable reductions in timecomplexity when compared to the conventional approach of having torecalculate the aggregation whenever at least one single match haschanged within a bin. This is particularly true for the, “classical”aggregation functions COUNT, SUM, AVG, MIN and MAX, as well as VAR,STDDEV and SKEWNESS. The performance gain of incrementally updatingCOUNT DISTINCT and MEDIAN will depend on the exact implementations ofthe aggregation functions in the case of non-incremental updates.Moreover, the time complexity of incrementally updating the aggregationsis unrelated to t. This is another advantage to the variousimplementations disclosed herein, as it enables trying more bins withoutany considerable performance costs. Trying more bins allows for findingthe ideal condition more accurately.

Example Calculation: Efficient Propagation of Updated Values

In the above section, reference is made to a data structure calledunique indices. This data structure keeps all of the ix_output whichhave been altered when incrementally updating the aggregations. This isbeneficial, because it is necessary when deciding to incrementallyupdate the optimization criterion. Incrementally updating this criteriais not possible without knowing which samples need updating.

The purpose of unique_indices is to keep the indices uniquely, meaningthat no ix_output should be contained twice. In various implementations,assumptions are made that the data structure has a method denoted.insert( . . . ), which adds an index if it is not already contained inthe data structure. As described above, any update of any aggregationwill call .insert( . . . ). It is therefore beneficial for performancethat this data structure is efficiently implemented. In particular, theprocess of insertion must be efficient. Several data structures can beused to facilitate this efficiency. Let n be the number of elementsalready inserted into the structure: (1) A binary search tree: In theworst case, the cost of insertion can be O(n), but on average it will beO(log n); (2) A red-black tree: In the worst case, the cost of insertioncan be O(log n), however, there is overhead for potentially having torestructure the tree after an insertion; (3) A hash map: In theory, thecost of insertion should be independent of the number of elementsalready inserted, but in practice, collisions can occur. However,approaches described herein have an advantage, because the range of theix_output to insert is known in advance. In this example, let k be thenumber of elements in the output table and thus the number of differentix_output. Then, use the following data structure: Keep an array ofBooleans of size k. Each element in that array represents one ix_outputand signifies whether that element has already been inserted. Keep avariable-size container (such as std::vector in C++) that contains theindices. When inserting into unique indices, check the array of Booleansto see whether the element has already been inserted, and if not, appendinto to the variable-size container. If the variable-size container hasbeen allocated sufficient capacity, then the time complexity of theinsertion operation is guaranteed to be O(1). Moreover, the resultingindices will be aligned in memory, which means that iteration throughthese elements is also fast and efficient.

Example Calculation: Intermediate Aggregations

As noted herein, it is possible that the path from the peripheral tableto the population table contains more than just one aggregation. Therecan be a set of nested aggregations such as the SUM of a COUNT or theAVG of a SUM. Therefore, it must be possible to update aggregationsbased not just on the changing of the activation status of a match, butalso based on the changed values of another aggregation. In thissection, any aggregation that is not at the lowest level of this chainis called an intermediate aggregation. This section describesimplementing aggregations as intermediate aggregations. Begin byexpressing the SUM aggregation in pseudo-code, as represented in FIG. 20(showing an algorithm for updating AVG as an intermediate aggregation).

Calculating Learnable Weights

When aggregating over columns, the approaches (e.g., algorithms)described herein produce features such as are shown in FIG. 4. Thesefeatures are very similar to the features a data scientist might writeby hand, albeit often more complex. However, they do have adisadvantage, namely that all possible aggregations must be applied toall possible values to be aggregated. This can produce many possiblecombinations. Therefore, some embodiments aggregate over learnableweights, producing features in the form exemplified in FIG. 21 (which inpractice, such features would consist of more than just two conditions).The primary benefit in this scenario is that for every possibleaggregation function there is only one possible set of values toaggregate per peripheral table. This greatly reduces the search space.The primary disadvantage is the limitation in the possible aggregationfunctions.

Optimization Criterion

The optimization criterion used for calculating the learnable weightscan be any loss function such as squared loss or log loss, describedaccording to the following examples.

Let L( . . . ) be a loss function, yi be the target for sample I andf_(t−1,i) be the prediction generated from all previous predictors inthe gradient boosted ensemble. The goal is to find a ∇ft,i thatminimizes the loss function over all samples:

$\begin{matrix}{\min\limits_{\nabla f_{t,i}}{\sum\limits_{i}{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}} & (10)\end{matrix}$

For training a new predictor ∇ft,i, follow the xgboost approach of usingthe second order Taylor approximation:

$\begin{matrix}{{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)} \approx {{L\left( {f_{{t - 1},i};y_{i}} \right)} + {\frac{\partial{L\left( {f_{{t - 1},i};y_{i}} \right)}}{\partial f_{{t - 1},i}}{\nabla f_{t,i}}} + {\frac{1}{2}\frac{\partial{\partial{L\left( {f_{{t - 1},i};y_{i}} \right)}}}{{\partial f_{{t - 1},i}}{\partial f_{{t - 1},i}}}{\nabla f_{t,i}^{2}}}}} & (11) \\{\mspace{79mu} {{{Let}\mspace{14mu} g_{i}}:={{\frac{\partial{L\left( {f_{{t - 1},{i;}}y_{i}} \right)}}{\partial f_{{t - 1},i}}\mspace{14mu} {and}\mspace{14mu} h_{i}}:={\frac{\partial{\partial{L\left( {f_{{t - 1},i};y_{i}} \right)}}}{{\partial f_{{t - 1},i}}{\partial f_{{t - 1},i}}}.}}}} & \; \\{\mspace{79mu} {{Deriving}\mspace{14mu} {for}\mspace{14mu} {\nabla f_{t,i}}\mspace{14mu} {yields}\text{:}}} & \; \\{\mspace{79mu} {\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial{\nabla f_{t,i}}} \approx {g_{i} + {h_{i}{\nabla f_{t,i}}}}}} & (12)\end{matrix}$

However, because of the relational nature of the dataset, ∇ft,i cannotbe freely optimized. Instead, it comprises a set of learnable weightsw0, w1, w2, . . . , wN or w that are aggregated through a series ofaggregations. These aggregations need to be linear transformations ofthe original weights. Examples of aggregations that are lineartransformations include SUM and AVERAGE. Based on the nature of theseaggregations, it is possible to calculate a set of impact factors 1,η1,i, η2,i, . . . , ηN,i or ηi, which are linear multipliers measuringthe impact of each of the weights on ∇ft,i, such that ∇ft,i can bewritten as follows:

$\begin{matrix}{{\nabla f_{t,i}} = {{\eta_{i}^{T}w} = {w_{0} + {\sum\limits_{n = 1}^{N}{\eta_{n,i}w_{n}}}}}} & (13)\end{matrix}$

The following is an example to illustrate how ηi is formed. Consider thefeature described in FIG. 4. As seen, this feature consists of twoweights w1 and w2, as well as the interaction term w0. Suppose thatthere are five matches between sample i in the population table and theperipheral table. Further suppose that for these five matches,condition1 is true three times and condition2 is true twice. Assumingthat the aggregation in use is a SUM aggregation, ∇ft,i can be writtenas follows:

∇f _(t,i) =w ₀+3w ₁+2w ₂   (14)

This implies that η1,,i=3 and η2,i=2. Since w0, w1, w2, . . . , wN isthe set of learnable weights, it is possible to optimize for thoseweights. Combining equation (12) and equation (13) yields the followingequations:

$\begin{matrix}{{\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial w_{0}} = {{\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial{\nabla f_{t,i}}}\frac{\partial{\nabla f_{t,i}}}{\partial w_{0}}} \approx \left( {g_{i} + {h_{i}\eta_{i}^{T}w}} \right)}},} & (15) \\{\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial w_{n}} = {{\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial{\nabla f_{t,i}}}\frac{\partial{\nabla f_{t,i}}}{\partial w_{n}}} \approx {\left( {g_{i} + {h_{i}\eta_{i}^{T}w}} \right){\eta_{n,i}.}}}} & (16)\end{matrix}$

When the set of weights (w0, w1, w2, . . . , wN) is optimal, the sum ofall gradients is zero:

$\begin{matrix}{{\sum\limits_{i}{\left( {g_{i} + {h_{i}\eta_{i}^{T}w}} \right)\eta_{i}}} = 0.} & (17)\end{matrix}$

Equation (17) can now be rearranged as follows:

$\begin{matrix}{{\sum\limits_{i}{h_{i}\eta_{i}\eta_{i}^{T}w}} = {- {\sum\limits_{i}{g_{i}{\eta_{i}.}}}}} & (18)\end{matrix}$

However, equation (18) is an (N+1)×(N+1) system of linear equations thatcan be solved using standard algorithms, for example, LU decomposition,Gaussian elimination or Cramer's rule.

Fixed Weights

For reasons of efficiency, it is often desirable not to optimize allweights at the same time. For example, let there be a set of fixedweights denoted as wf and let μ be the set of associated impact factors.Equation (17) can then be rewritten as follows:

$\begin{matrix}{{\sum\limits_{i}{\left( {g_{i} + {h_{i}\eta_{i}^{T}w} + {h_{i}\mu_{i}^{T}w_{f}}} \right)\eta_{i}}} = 0.} & (19)\end{matrix}$

To retrieve a new system of linear equations, it is possible torearrange equation (19) as follows:

$\begin{matrix}{{\sum\limits_{i}{h_{i}\eta_{i}\eta_{i}^{T}w}} = {- {\sum\limits_{i}{\left( {g_{i} + {h_{i}\mu_{i}^{T}w_{f}}} \right){\eta_{i}.}}}}} & (20)\end{matrix}$

Regularization

One of the advantages of xgboost is that L2 regularization can beseamlessly integrated into the concept. With a few modifications, thesame is true for this approach. For example, let Ri be an (N+1)×(N+1)diagonal matrix that contains the values 1, η1,i, η2,i, . . . , ηN,i orηi on its diagonal. The L2 regularization function R(ηi, w) is thendefined as follows:

$\begin{matrix}{{R\left( {\eta_{i},w} \right)}:={{\frac{1}{2}w^{T}R_{i}w} = {\frac{1}{2}{\sum\limits_{n}{\eta_{n,i}w_{n}^{2}}}}}} & (21)\end{matrix}$

This can also lead to a modification of the optimization criterion. Forexample, let λ be the L2 regularization term. The goal is to find aΕft,i that minimizes the loss function plus the regularization term overall samples:

$\begin{matrix}{\min\limits_{\nabla\; f_{t,i}}{\sum\limits_{i}\left( {{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)} + {\lambda \; {R\left( {\eta_{i},w} \right)}}} \right)}} & (22)\end{matrix}$

Then, equation (17) can be modified as follows:

$\begin{matrix}{{\frac{\partial{L\left( {{f_{{t - 1},i} + {\nabla f_{t,i}}};y_{i}} \right)}}{\partial w_{n}} + {\lambda \frac{\partial{R\left( {\eta_{i},w} \right)}}{\partial w_{n}}}} \approx {{\left( {g_{i} + {h_{i}\eta_{i}^{T}w}} \right)\eta_{n,i}} + {{\lambda\eta}_{n,i}w_{n}}}} & (23)\end{matrix}$

In the optimum scenario, the sum of all gradients must be zero for allweights (w0, w1, w2, . . . , wN):

$\begin{matrix}{{{\sum\limits_{i}{\left( {g_{i} + {h_{i}\eta_{i}^{T}w}} \right)\eta_{i}}} + {\lambda {\sum\limits_{i}{R_{i}w}}}} = 0.} & (24)\end{matrix}$

This leads to a new system of linear equations:

$\begin{matrix}{{\sum\limits_{i}{\left( {{h_{i}\eta_{i}\eta_{i}^{T}} + {\lambda \; R_{i}}} \right)w}} = {- {\sum\limits_{i}{g_{i}{\eta_{i}.}}}}} & (25)\end{matrix}$

Regularization and Fixed Weights

When there are fixed weights and regularization, it is possible tocombine equation (20) and equation (25) to produce a system of linearequations that appear as follows:

$\begin{matrix}{{\sum\limits_{i}{\left( {{h_{i}\eta_{i}\eta_{i}^{T}} + {\lambda \; R_{i}}} \right)w}} = {- {\sum\limits_{i}{\left( {g_{i} + {h_{i}\mu_{i}^{T}w_{f}}} \right){\eta_{i}.}}}}} & (26)\end{matrix}$

FIG. 22 shows a flow chart illustrating specific implementations of themethod of the invention. Reference numeral 231 indicates a plurality ofsample data sets i which jointly form the sample data. From this sampledata a feature shall be extracted that is suitable to predict targetvalues associated with the sample data. The sample data sets i may bestored in one or more peripheral tables or alternatively in thepopulation table or partly in the population table and partly in one ormore peripheral tables.

Reference numeral 232 indicates a process to calculate feature values∇ft,i from the sample data sets i by applying an aggregation function onthe sample data sets i. An aggregation function may be any mathematicalfunction that calculates one value from a sample data set i, for exampleby summing all values. Also more complex functions may be used tocalculate the feature values ∇ft,i. For example, a feature may comprisetwo or more aggregation functions; the first aggregation function (e.g.,a MEDIAN function) aggregates several portions of a sample data set tooutput several first aggregation results; then the second aggregationfunction (e.g., a SUM function) aggregates the several first aggregationresults to output one (second) aggregation result for this sample dataset.

The number of calculated aggregation results/feature values ∇ft,i maycorrespond to the number I of sample data sets. These feature values∇ft,i are transferred into the population table 233. The populationtable 233 also includes one target value T_(i) for each sample data seti, i.e., for each calculated feature value ∇ft,i.

Reference numeral 234 indicates a process of training a regression modelwith the feature values ∇ft,i to predict the target values Ti. Forexample, a linear regression may be used. The regression model uses onlyvalues in the population table 233 but not values of the peripheraltables in which the sample data sets i may be stored.

In process 235 an optimization criterion function such as R-squared or aloss function is used to calculate a quality with which the featurevalues ∇ft,i predict or allow to predict the target values Ti. ForR-squared, a large value indicates a high quality whereas a small valuecalculated with a loss function indicates a high quality.

The quality depends on how the target values Ti are calculated. With agiven aggregation function and given sample data, the target values Tidepend on how the sample data is aggregated, i.e., which part of thesample data is aggregated or which weight is assigned to the sample datawhen not the sample data itself but the weights are aggregated. One ormore conditions are used which define how the sample data is aggregated.The choice of conditions is vital for the resulting quality. However, inprinciple a huge number of different conditions come into question,which poses the question how to efficiently calculate qualities for alarge number of different conditions. This is achieved inimplementations of the method of the invention, one such implementationbeing described in more detail with respect to the flowchart shown inFIG. 23.

In process 241 of FIG. 23, the feature of FIG. 22 is adjusted by varyinga condition applied to the aggregation function used to calculate thefeature values ∇ft,i. Exemplary conditions are indicated in otherFigures with the WHERE command or the WHEN command. The condition may besuch that only sample data that meets the condition (or alternativelysample data that does not meet the condition), will be aggregated withthe aggregation function. The condition may refer to the sample dataitself or to additional data associated with the sample data. Thefeature values ∇ft,i calculated in process 241 may vary from thepreviously calculated feature values (of e.g. FIG. 22) and may thus bereferred to as ∇ft,i^(new).

Performing processes 232 and 235 of FIG. 23 again each time a newcondition is tested would require much calculation power. This effortcan be partly reduced if previous calculation results are used andmodified to account for differences caused by the present condition.Process 242 determines which sample data sets i are affected by thevaried condition. For example, the sample data may include a multitudeof sample data sets i and only some of these sample data sets i includedata for which the present condition changes whether or not the datashall be aggregated.

Process 243 calculates updated feature values ∇ft,i^(new) only for thesample data sets affected by the varied condition. In other words,previously calculated aggregation results that do not change due to thechange in the condition of process 241 are not recalculated.

Process 244 then calculates the quality for the present case, i.e., forthe presently applied condition. This calculation does not use theformula used in process 235 of FIG. 23 to initially calculate thequality. Instead, the quality is now calculated using the updatedfeature values ∇ft,i^(new) of process 243 and previous feature values∇ft,i calculated before the presently applied condition was used. Anexample of such a calculation is described later in more detail.

Process 245 repeats the above processes 241 to 244 for differentconditions. In particular, each time the processes 241 to 244 arerepeated, the previously applied condition may be incrementally changed.For example, an increment may be a step x by which a numerical value yof a condition, such as a threshold, is each time changed. The value ymay thus be changed to y+x, then to y+2x and so forth.

After a respective quality has been determined for a plurality ofdifferent conditions, process 246 determines with which condition thebest quality was achieved. Depending on the quality or optimizationcriterion, the best quality may either be the highest or lowest value.

Process 247 uses the (first) condition determined in process 246 andadds a further (second) condition. The above-described processes arerepeated for the further second condition to determine the best-possiblequality and then use this second condition that yields the best quality.When the second condition is varied, the previously determined firstcondition stays fixed and is not changed. This may be described asgreedily adding conditions. Sample data that the first conditionclassifies as not to be aggregated may be split by the second conditioninto two groups, one of which shall be aggregated and the other shallnot be aggregated. This is also described as changing the activationstatus for this part of the sample data. Further different conditionsmay greedily be added to both the TRUE and FALSE outcomes of the firstcondition. In this way, a condition tree is formed. A feature that maynow be output to be used in other applications may comprise one or moreaggregation functions and this condition tree, i.e., a plurality ofconditions together with the indication whether or not sample data shallbe aggregated if the condition(s) are met or not met. In implementationsin which not the sample data itself but weights assigned to the sampledata is aggregated, a feature may comprise one or more aggregationfunctions, the conditions and the values of the weights. In amachine-learning application, the feature determined as described aboveis used to calculate feature values from the present sample data.

FIG. 24 shows an example of the process 243 of FIG. 23 in more detail.An incremental change of a condition may not change the aggregationresults for all sample data sets i but just for some sample data sets.For these sample data sets the aggregation results needs to be updatedwhen the condition is varied. However, it is not always necessary toredo the whole aggregation calculation (as done in process 232 of FIG.22). Instead, process 243.1 may first identify which sample valueswithin one sample data set are affected by the varied condition, e.g.,which sample values M were previously aggregated but shall not beaggregated anymore according to the varied condition. Process 243.2 thencalculates an updated feature value ∇ft,i^(new) by using the previouslycalculated feature value ∇ft,i^(old) for this sample data set andadjusting it to account for the contribution of the sample values M. Inthe case of a SUM function as an aggregation function, this adjustmentcan be easily done by adding or subtracting the values of the samplevalues M from the previously calculated feature value ∇fr,i^(old). Inthis way, calculating an updated feature value ∇ft,i^(new) may besignificantly accelerated, in particular if one data set includes alarge number of values, such as 10,000 values, and the incrementalchange in the condition only affects a few values, e.g., less than 10values.

FIG: 25 illustrates an example of process 244 regarding the calculationof a quality value when a condition is varied. This example is based onR-squared as the optimization criterion function. Loss functions orother optimization criterion function may be used instead. Thecalculation of a quality according to the formula of R-squared shown inprocess 235 of FIG. 22 puts rather high demands on computation power.Hence, this calculation is not performed each time a condition isvaried. Instead, the optimization criterion update formula shown inprocess 244 of FIG. 25 may be used. It uses previously calculatedfeature value ∇f_(t,i) ^(old) and describes how a change in a specificfeature value, e.g. for i=2: a change from ∇f_(t,2) ^(old) to ∇f_(t,2)^(new) affects the resulting quality measure R-squared. The calculationshown in FIG. 25 of updating the R-squared value may be particularlyadvantageous if not all but rather a small part of all feature values∇f_(t,i) are affected by a change in the condition. This is often thecase when incremental changes are used. The procedure of FIG. 25 thusoffers a way to efficiently determine qualities for a large number ofdifferent conditions that slightly vary from each other.

It is understood that the update rate calculation can be applied in anymanner described herein.

Applications

Various approaches for analyzing statistical models and relational datamodels can be applied to the implementations described herein. Forexample, datasets can be refined by techniques such as gradientboosting, random forests, AdaBoost and neural networks. In variousimplementations, statistical predictive models can develop and adjustunderlying algorithms, equations, connections, sub-models, etc., thereinby processing multiple inputs to calculate outputs and compare outputsto predetermined or expected values.

Statistical predictive models can take the form of an artificial neuralnetwork (ANN), and more specifically can include one or moresub-classifications of ANN architectures, whether currently-known orlater developed. In one example, statistical predictive models can takethe form of a multilayer perceptron (MLP) neural network. MLP neuralnetworks may be distinguished from other neural networks, e.g., bymapping sets of input data onto corresponding sets of outputs by way ofa directed graph. MLP neural networks can rely upon automatic supervisedlearning, e.g., through one or more backpropagation processes describedherein. MLP neural networks may be particularly suitable for sets ofdata which may not be linearly separable by conventional mathematicaltechniques.

Alternative Embodiments and Implementations

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module,” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be used.A computer readable storage medium may be, for example, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing. Computer program code for carrying out operations foraspects of the present invention may be written in any combination ofone or more programming languages, including an object orientedprogramming language such as Java, C++, C, Go, Rust, D or Scala or thelike and conventional procedural programming languages, such as the “C”programming language or similar programming languages. The program codemay execute entirely on the user's computer, partly on the user'scomputer, as a stand-alone software package, partly on the user'scomputer and partly on a remote computer or entirely on the remotecomputer or server. In the latter scenario, the remote computer may beconnected to the user's computer through any type of network, includinga local area network (LAN) or a wide area network (WAN), or theconnection may be made to an external computer (for example, through theInternet using an Internet Service Provider).

Aspects of the present disclosure are described above with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks. The computer program instructions may also beloaded onto a computer, other programmable data processing apparatus, orother devices to cause a series of operational steps to be performed onthe computer, other programmable apparatus or other devices to produce acomputer implemented process such that the instructions which execute onthe computer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowcharts and block diagrams in the Figures illustrate the layout,functionality, and operation of possible implementations of systems,methods and computer program products according to various embodimentsof the present invention. In this regard, each block in the flowchart orblock diagrams may represent a module, segment, or portion of code,which comprises one or more executable instructions for implementing thespecified logical function(s). It should also be noted that, in somealternative implementations, the functions noted in the block may occurout of the order noted in the figures. For example, two blocks shown insuccession may, in fact, be executed substantially concurrently, or theblocks may sometimes be executed in the reverse order, depending uponthe functionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts, or combinations of special purpose hardware andcomputer instructions.

As used herein, the term “configured,” “configured to” and/or“configured for” can refer to specific-purpose patterns of the componentso described. For example, a system or device configured to perform afunction can include a computer system or computing device programmed orotherwise modified to perform that specific function. In other cases,program code stored on a computer-readable medium (e.g., storagemedium), can be configured to cause at least one computing device toperform functions when that program code is executed on that computingdevice. In these cases, the arrangement of the program code triggersspecific functions in the computing device upon execution. In otherexamples, a device configured to interact with and/or act upon othercomponents can be specifically shaped and/or designed to effectivelyinteract with and/or act upon those components. In some suchcircumstances, the device is configured to interact with anothercomponent because at least a portion of its shape complements at least aportion of the shape of that other component. In some circumstances, atleast a portion of the device is sized to interact with at least aportion of that other component. The physical relationship (e.g.,complementary, size-coincident, etc.) between the device and the othercomponent can aid in performing a function, for example, displacement ofone or more of the device or other component, engagement of one or moreof the device or other component, etc.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

We claim:
 1. A computer-implemented method of generating a feature for amachine-learning application, the method comprising: a) identifying apreliminary feature comprising at least one aggregation function thataggregates relational sample data or learnable weights applied to thesample data; b) calculating aggregation results with the aggregationfunction; c) determining, based upon an optimization criterion formula,a quality with which the aggregation results relate to target values; d)adjusting the preliminary feature by incrementally changing a conditionapplied to the aggregation function; e) calculating aggregation resultsof the adjusted preliminary feature by adjusting the aggregation resultsfrom a previous preliminary feature; f) determining a quality with whichthe aggregation results of the adjusted preliminary feature relate tothe target values; and g) repeating processes (d) through (f) for aplurality of incremental changes, and selecting a feature that yieldsthe best quality.
 2. The computer-implemented method of claim 1, whereinthe condition affects which sample data is to be aggregated with theaggregation function.
 3. The computer-implemented method of claim 1,wherein process (g) includes repeating processes (d) through (f) atleast twenty times, and each time changing the applied condition suchthat the aggregation function aggregates a larger share of the sampledata or each time a smaller share of the sample data.
 4. Thecomputer-implemented method of claim 1, wherein the condition splits thesample data into different groups associated with different learnableweights.
 5. The computer-implemented method of claim 1, wherein process(e) comprises: incrementally adjusting the aggregation results from theprevious preliminary feature.
 6. The computer-implemented method ofclaim 1, further comprising: (h) outputting the adjusted feature thatyields the best quality for use in the machine-learning application. 7.The computer-implemented method of claim 1, further comprisingincrementally repeating processes (a) through (g) for a plurality ofadditional conditions, wherein the adjusted feature becomes thepreliminary feature during the repeating.
 8. The computer-implementedmethod of claim 7, wherein calculating the result of the changedaggregation function includes using only a difference between the resultfrom a current aggregation function and the result from a previousaggregation function.
 9. The computer-implemented method of claim 7,wherein outputting the adjusted feature is performed for only one of theadjusted features that has the best quality.
 10. Thecomputer-implemented method of claim 1, wherein quality is defined bythe optimization criterion formula.
 11. The computer-implemented methodof claim 1, wherein each feature is attributed to a single aggregationfunction.
 12. The computer-implemented method of claim 1, furthercomprising: performing processes (a) through (g) using a distinctpreliminary feature; calculating a pseudo-residual for the adjustedfeature; and training the distinct preliminary feature to predict anerror from the adjusted feature.
 13. The computer-implemented method ofclaim 1, further comprising: identifying which part of the sample datais affected by the presently applied incremental change of thecondition, with the help of a match change identification algorithm,wherein process (e) of calculating the aggregation results of theadjusted preliminary feature comprises: adjusting the aggregation resultfrom the previous preliminary feature by calculating how the sample dataidentified by the match change identification algorithm changes theaggregation results from the previous preliminary feature.
 14. Thecomputer-implemented method of claim 1, wherein the condition defines athreshold for sample data, and only sample data on a specific side ofthe threshold is to be aggregated with the aggregation function, orwherein the condition defines a threshold for additional data associatedwith the sample data, and only sample data for which the associatedadditional data is on a specific side of the threshold is to beaggregated with the aggregation function, or wherein the conditiondefines a categorical value and only sample data to which thecategorical value of the condition is attributed is to be aggregatedwith the aggregation function.
 15. The computer-implemented method ofclaim 1, further comprising: after selecting the feature that yields thebest quality in process (g): (h) adding a further condition to theaggregation function and repeating processes (d) through (g) todetermine the feature with the further condition that yields the bestquality; and repeating process (h) to add still further conditions untila stop algorithm determines that no more conditions are to be applied tothe aggregation function.
 16. The computer-implemented method of claim1, further comprising: using an optimization criterion update formulathat describes how a change in the aggregation result of the aggregationfunction changes a previously calculated quality calculated with theoptimization criterion formula, wherein process (f) of determining thequality for adjusted preliminary features comprises: determining withthe optimization criterion update formula how the change in theaggregation result affects the quality, without using the optimizationcriterion formula for calculating the quality for adjusted preliminaryfeatures.
 17. The computer-implemented method of claim 1, wherein therelational sample data comprises sample data sets in one or moreperipheral tables, wherein the target values are included in apopulation table, wherein calculated aggregation results are insertedinto the population table, wherein the optimization criterion formulauses values from the population table but not from any peripheral table.18. The computer-implemented method of claim 1, wherein the preliminaryfeature comprises at least a first and second aggregation functions; thesecond aggregation function calculates an aggregation result fromaggregation results of the first aggregation function; and one or moreconditions are applied to at least one of the aggregation functions andprocesses (d) to (g) are carried out with respect to the one or moreconditions.
 19. A computer-implemented machine-learning methodcomprising: using one or more features determined and selected with themethod of claim 1 to calculate aggregation results from sample data thatis at least partially included in one or more peripheral tables; joiningthe calculated aggregation results to a population table which includestarget values; and training a machine-learning algorithm based on theaggregation results and the target values in the population table.
 20. Asystem comprising: a computing device having a processor and a memory,the computing device configured to generate a feature for amachine-learning application by performing processes including: a)identifying a preliminary feature comprising at least one aggregationfunction that aggregates relational sample data or learnable weightsapplied to the sample data; b) calculating aggregation results with theaggregation function; c) determining, based upon an optimizationcriterion formula, a quality with which the aggregation results relateto target values; d) adjusting the preliminary feature by incrementallychanging a condition applied to the aggregation function; e) calculatingaggregation results of the adjusted preliminary feature by adjusting theaggregation results from a previous preliminary feature; f) determininga quality with which the aggregation results of the adjusted preliminaryfeature relate to the target values; and g) repeating processes (d)through (f) for a plurality of incremental changes, and selecting afeature that yields the best quality.